Implement and use cuda graph plans #16548

wishstudio · 2025-10-13T01:26:43Z

This PR tries to implement graph plan APIs for the CUDA backend, as well as implement code in ggml-backend.cpp to actually use the graph plan APIs when a backend supports it.

The main functional improvement is to support cuda graphs when the graph is split (e.g. for hybrid inference). Currently the graph update and reuse logic (ggml_backend_sched_update_plans) is a simple heuristic: only try updating previous graphs when the number of splits and their corresponding backends are the same as the previous run. As the benchmark shown this universally accelerate hybrid inference tg performance by up to 30%.

The CUDA graph execution code is refactored and cleaned up. Two out of three original graph plan fail path are removed: disable_due_to_failed_graph_capture and disable_due_to_too_many_updates. The former one is due to the fact I found no code setting it to true. The latter is because I have currently no idea about the semantics in a split graph scenario. But it seems to not degrade the performance at all. Interestingly, I found that on my rig, even repeatitively build a graph then execute it only once is always faster than calling the kernels individually. I suspect it is the reason that the performance increased in tests even for CUDA only workloads, given this PR's optimization not targeting them. This of course needs to be verified on more hardware configurations.

Performance comparison:
RTX 5090 + 13700k, 128GB 6400 MT/s RAM

model	n_cpu_moe	test	t/s master	t/s pr	Speedup
gpt-oss 20B MXFP4 MoE	0	pp512	9070.14	8768.06	0.97
gpt-oss 20B MXFP4 MoE	0	tg128	273.99	278.43	1.02
gpt-oss 20B MXFP4 MoE	99	pp512	916.16	931.72	1.02
gpt-oss 20B MXFP4 MoE	99	tg128	42.4	47.2	1.11
gpt-oss 120B MXFP4 MoE	24	pp512	150.76	150.31	1.00
gpt-oss 120B MXFP4 MoE	24	tg128	36.73	45.04	1.23
gpt-oss 120B MXFP4 MoE	99	pp512	187.69	186.21	0.99
gpt-oss 120B MXFP4 MoE	99	tg128	28.24	31.7	1.12
glm4moe 106B.A12B Q4_K	34	pp512	81.4	79.9	0.98
glm4moe 106B.A12B Q4_K	34	tg128	18.69	21.72	1.16
glm4moe 106B.A12B Q4_K	99	pp512	114.85	114.34	1.00
glm4moe 106B.A12B Q4_K	99	tg128	14.59	16.01	1.10
glm4moe 355B.A32B Q2_K	99	pp512	25.3	26.96	1.07
glm4moe 355B.A32B Q2_K	99	tg128	7.73	8.74	1.13
qwen3moe 235B.A22B Q3_K	80	pp512	59.99	61.26	1.02
qwen3moe 235B.A22B Q3_K	80	tg128	10.66	11.77	1.10
qwen3moe 235B.A22B Q3_K	99	pp512	85.38	88.45	1.04
qwen3moe 235B.A22B Q3_K	99	tg128	9.27	10.21	1.10
qwen3moe 30B.A3B Q4_K	0	pp512	6806.34	7295.45	1.07
qwen3moe 30B.A3B Q4_K	0	tg128	246.12	273	1.11
qwen3moe 30B.A3B Q4_K	99	pp512	531.72	560.52	1.05
qwen3moe 30B.A3B Q4_K	99	tg128	36.1	48.4	1.34
qwen3 8B Q8_0		pp512	10002.11	11383.42	1.14
qwen3 8B Q8_0		tg128	124.07	136.75	1.10
llama 13B Q6_K		pp512	3886.45	4175.62	1.07
llama 13B Q6_K		tg128	65.89	72.19	1.10
llama 8B Q8_0		pp512	10092.17	11702.73	1.16
llama 8B Q8_0		tg128	127.96	142.67	1.11
gemma3 12B Q8_0		pp512	6696.89	7920.99	1.18
gemma3 12B Q8_0		tg128	79.01	88.73	1.12
nemotron_h 9B Q8_0		pp512	7243.57	7808.74	1.08
nemotron_h 9B Q8_0		tg128	114.51	122.93	1.07

Implement and use cuda graph plans.

34b473c

wishstudio requested review from ggerganov and slaren as code owners October 13, 2025 01:26

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 13, 2025

Merge remote-tracking branch 'upstream' into cuda_graph_plan

c17f8b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement and use cuda graph plans #16548

Implement and use cuda graph plans #16548

wishstudio commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Implement and use cuda graph plans #16548

Are you sure you want to change the base?

Implement and use cuda graph plans #16548

Conversation

wishstudio commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant